[Bugfix] Fix MLA attention crash with AWQ/GPTQ quantized models by haosdent · Pull Request #34695 · vllm-project/vllm

haosdent · 2026-02-17T12:45:32Z

Purpose

Fix AttributeError: 'ColumnParallelLinear' object has no attribute 'weight' when running MLA models with AWQ/GPTQ quantization (e.g., cyankiwi/GLM-4.7-Flash-AWQ-4bit).

Closes #34561.

Root cause: MLA attention code accesses self.kv_b_proj.weight.dtype in 3 places, but AWQ/GPTQ-quantized ColumnParallelLinear layers store weights as qweight (packed int32), not weight. The code only accounted for unquantized and FP8-quantized weights.

Fix: Guard .weight.dtype accesses with hasattr(self.kv_b_proj, "weight") checks:

Line 406 (MLAAttention.__init__): Added hasattr guard in the and chain for the ROCm fp4 BMM check. Short-circuits to False for AWQ/GPTQ — correct, since packed int32 weights can't be used with fp4 BMM.
Lines 2357-2371 (MLACommonImpl._compute_prefill_context): Introduced a local _kv_b_proj_w_dtype that uses weight.dtype when available, falling back to params_dtype (always present on LinearBase) for quantized layers. params_dtype is the model's compute dtype (e.g., bf16), which is the correct input dtype that AWQ/GPTQ layers expect.

Correctness verified across all quantization methods:

Quantization	`hasattr(weight)`	dtype used	fp4 BMM check	Prefill cast behavior
Unquantized bf16	True	bf16	Enabled if fp4bmm available	Cast to bf16
Unquantized fp16	True	fp16	Disabled (fp16≠bf16)	Cast to fp16
FP8	True	float8_e4m3fn	Disabled (fp8≠bf16)	fp8 prefill: cast to fp8; no fp8 prefill: skip cast
AWQ (W4A16)	False	params_dtype (bf16)	Disabled (short-circuit)	Cast to bf16
GPTQ (W4A16)	False	params_dtype (fp16)	Disabled (short-circuit)	Cast to fp16

Test Plan

Syntax and structural verification — AST parse, confirmed exactly 2 hasattr guards and 1 params_dtype fallback.
Mock-based logic verification — tested all 5 quantization scenarios (unquantized bf16, unquantized fp16, FP8 with/without FP8 prefill, AWQ) with mock objects simulating the ColumnParallelLinear layer behavior.

Test Result

Unquantized bf16: hasattr(weight)=True, weight.dtype=bf16
FP8: hasattr(weight)=True, weight.dtype=fp8
AWQ: hasattr(weight)=False, params_dtype=bf16
Line 406 logic: PASS for all 3 scenarios
Unquant, no fp8 prefill: cast=True, dtype=torch.bfloat16 -- PASS
FP8, no fp8 prefill: cast=False -- PASS
FP8, fp8 prefill: cast=True, dtype=torch.float8_e4m3fn -- PASS
AWQ, no fp8 prefill: cast=True, dtype=torch.bfloat16 -- PASS

All verification checks PASSED

…-project#34561) Fix AttributeError when using AWQ/GPTQ quantized MLA models (e.g., GLM-4.7-Flash-AWQ) by guarding `kv_b_proj.weight.dtype` accesses with `hasattr` checks and falling back to `params_dtype`. Signed-off-by: haosdent <haosdent@gmail.com>

gemini-code-assist

Code Review

This pull request effectively addresses a crash that occurs when running MLA models with AWQ/GPTQ quantization. The root cause, an AttributeError from accessing a non-existent .weight attribute on quantized layers, is correctly identified. The fix is clean and robust, using hasattr guards to prevent the error. The fallback to params_dtype for quantized layers in _compute_prefill_context is a logical and well-justified approach. The changes are minimal, targeted, and well-documented in the pull request description. Overall, this is an excellent bugfix.

mgoin · 2026-02-17T15:47:30Z

cc @LucasWilkinson @MatthewBonanni @pavanimajety to review

cjackal · 2026-02-23T11:41:27Z

I've checked that with this PR deepseek v3 awq checkpoint is loaded successfully and gets normal accuracy(gsm8k: 0.945) (naturally). The change looks accurate and simple, can we review this PR to unblock awq/gptq models from run on latest main w/ transformers v5?

haosdent · 2026-02-23T12:25:07Z

Thanks @cjackal 's test. @LucasWilkinson @MatthewBonanni @pavanimajety can you help to review? Thank you in advance.

babyplutokurt · 2026-02-23T19:39:59Z

I also verified this patch fixs the error serving GLM-4.7-Flash-GPTQ-4bits, with single and batch requests.

MatthewBonanni

LGTM, thanks for the contribution!

…-project#34695) Signed-off-by: haosdent <haosdent@gmail.com>

Restore three upstream changes in MLACommonImpl that were accidentally removed in initial AITER commits: 1. Add back logger.info_once for backend selection (TRT-LLM, FlashInfer, CUDNN, FlashAttention) - helpful for debugging 2. Restore FA4 support in _pad_v logic - FA4 natively handles different head dimensions like FA3 on Hopper 3. Restore params_dtype fallback for AWQ/GPTQ quantized models (PR vllm-project#34695) - Quantized layers may lack .weight attribute These changes are in MLACommonImpl (shared backend selector), not related to AITER fused kernel functionality which is in MLAAttention class. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>

…-project#34695) Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>

…-project#34695) Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>

…-project#34695) Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>

haosdent requested a review from LucasWilkinson as a code owner February 17, 2026 12:45

mergify bot added the bug Something isn't working label Feb 17, 2026

gemini-code-assist bot reviewed Feb 17, 2026

View reviewed changes

haosdent changed the title ~~[Bugfix] Fix MLA attention crash with AWQ/GPTQ quantized models~~ [WIP][Bugfix] Fix MLA attention crash with AWQ/GPTQ quantized models Feb 17, 2026

mgoin requested a review from pavanimajety February 17, 2026 15:47

eugr mentioned this pull request Feb 17, 2026

glm-4.7-flash-awq - AttributeError: 'ColumnParallelLinear' object has no attribute 'weight' eugr/spark-vllm-docker#38

Closed

haosdent changed the title ~~[WIP][Bugfix] Fix MLA attention crash with AWQ/GPTQ quantized models~~ [Bugfix] Fix MLA attention crash with AWQ/GPTQ quantized models Feb 23, 2026

MatthewBonanni approved these changes Mar 6, 2026

View reviewed changes

MatthewBonanni added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 6, 2026

MatthewBonanni enabled auto-merge (squash) March 6, 2026 12:33

MatthewBonanni and others added 2 commits March 6, 2026 11:05

Merge branch 'main' into fix-34561

0f6b38a

Merge branch 'main' into fix-34561

88b81aa

pavanimajety approved these changes Mar 13, 2026

View reviewed changes

MatthewBonanni merged commit 6d53efd into vllm-project:main Mar 13, 2026
54 checks passed

Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026

[Bugfix] Fix MLA attention crash with AWQ/GPTQ quantized models (vllm…

da1a0bf

…-project#34695) Signed-off-by: haosdent <haosdent@gmail.com>

wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026

[Bugfix] Fix MLA attention crash with AWQ/GPTQ quantized models (vllm…

086aa54

…-project#34695) Signed-off-by: haosdent <haosdent@gmail.com>

fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026

[Bugfix] Fix MLA attention crash with AWQ/GPTQ quantized models (vllm…

dea89f3

…-project#34695) Signed-off-by: haosdent <haosdent@gmail.com>

khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026

[Bugfix] Fix MLA attention crash with AWQ/GPTQ quantized models (vllm…

1ea7792

…-project#34695) Signed-off-by: haosdent <haosdent@gmail.com>

vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Mar 30, 2026

[Bugfix] Fix MLA attention crash with AWQ/GPTQ quantized models (vllm…

599c7ef

…-project#34695) Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>

EricccYang pushed a commit to EricccYang/vllm that referenced this pull request Apr 1, 2026

[Bugfix] Fix MLA attention crash with AWQ/GPTQ quantized models (vllm…

54159b7

…-project#34695) Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix MLA attention crash with AWQ/GPTQ quantized models#34695

[Bugfix] Fix MLA attention crash with AWQ/GPTQ quantized models#34695
MatthewBonanni merged 3 commits intovllm-project:mainfrom
haosdent:fix-34561

haosdent commented Feb 17, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

mgoin commented Feb 17, 2026

Uh oh!

cjackal commented Feb 23, 2026

Uh oh!

haosdent commented Feb 23, 2026

Uh oh!

babyplutokurt commented Feb 23, 2026

Uh oh!

MatthewBonanni left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

Conversation

haosdent commented Feb 17, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mgoin commented Feb 17, 2026

Uh oh!

cjackal commented Feb 23, 2026

Uh oh!

haosdent commented Feb 23, 2026

Uh oh!

babyplutokurt commented Feb 23, 2026

Uh oh!

MatthewBonanni left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

haosdent commented Feb 17, 2026 •

edited by github-actions bot

Loading